{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import faiss \n",
    "import torch\n",
    "import numpy as np\n",
    "from tqdm import trange\n",
    "from sentence_transformers import SentenceTransformer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Load the Embedding Model\n",
    "### 加载embedding模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "embedding_model = \"./model/nomic-embed-text-v1\"\n",
    "embedding_model = SentenceTransformer(embedding_model, trust_remote_code=True)\n",
    "\n",
    "embedding_model.to(torch.device('cuda'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the pre-prepared JSON file in the format required by TinyDB (refer to arxiv_paper_db.json for the specific format). Each index corresponds to a single paper.\n",
    "### 读取准备好符合tinydb要求json文件（具体格式参考arxiv_paper_db.json）, 每个索引index对应一篇paper"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "with open('./database/arxiv_paper_db.json','r') as f:\n",
    "    papers = json.loads(f.read())\n",
    "papers_l = list(papers['cs_paper_info'].items())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[('1',\n",
       "  {'id': '1811.06122v1',\n",
       "   'title': 'The case for shifting the Renyi Entropy',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06122v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'We introduce a variant of the R\\\\\\'enyi entropy definition that aligns it with\\nthe well-known H\\\\\"older mean: in the new formulation, the r-th order R\\\\\\'enyi\\nEntropy is the logarithm of the inverse of the r-th order H\\\\\"older mean. This\\nbrings about new insights into the relationship of the R\\\\\\'enyi entropy to\\nquantities close to it, like the information potential and the partition\\nfunction of statistical mechanics. We also provide expressions that allow us to\\ncalculate the R\\\\\\'enyi entropies from the Shannon cross-entropy and the escort\\nprobabilities. Finally, we discuss why shifting the R\\\\\\'enyi entropy is fruitful\\nin some applications.',\n",
       "   'cat': 'cs.IT',\n",
       "   'authors': ['Francisco José Valverde-Albacete', 'Carmen Peláez-Moreno']}),\n",
       " ('2',\n",
       "  {'id': '1811.06115v1',\n",
       "   'title': 'Deep Learning in the Wavelet Domain',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06115v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': \"This paper examines the possibility of, and the possible advantages to\\nlearning the filters of convolutional neural networks (CNNs) for image analysis\\nin the wavelet domain. We are stimulated by both Mallat's scattering transform\\nand the idea of filtering in the Fourier domain. It is important to explore new\\nspaces in which to learn, as these may provide inherent advantages that are not\\navailable in the pixel space. However, the scattering transform is limited by\\nits inability to learn in between scattering orders, and any Fourier domain\\nfiltering is limited by the large number of filter parameters needed to get\\nlocalized filters. Instead we consider filtering in the wavelet domain with\\nlearnable filters. The wavelet space allows us to have local, smooth filters\\nwith far fewer parameters, and learnability can give us flexibility. We present\\na novel layer which takes CNN activations into the wavelet space, learns\\nparameters and returns to the pixel space. This allows it to be easily dropped\\nin to any neural network without affecting the structure. As part of this work,\\nwe show how to pass gradients through a multirate system and give preliminary\\nresults.\",\n",
       "   'cat': 'cs.CV',\n",
       "   'authors': ['Fergal Cotter', 'Nick Kingsbury']}),\n",
       " ('3',\n",
       "  {'id': '1811.06114v2',\n",
       "   'title': 'Prophet Inequalities for I.I.D. Random Variables from an Unknown\\n  Distribution',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06114v2',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'A central object in optimal stopping theory is the single-choice prophet\\ninequality for independent, identically distributed random variables: Given a\\nsequence of random variables $X_1,\\\\dots,X_n$ drawn independently from a\\ndistribution $F$, the goal is to choose a stopping time $\\\\tau$ so as to\\nmaximize $\\\\alpha$ such that for all distributions $F$ we have\\n$\\\\mathbb{E}[X_\\\\tau] \\\\geq \\\\alpha \\\\cdot \\\\mathbb{E}[\\\\max_tX_t]$. What makes this\\nproblem challenging is that the decision whether $\\\\tau=t$ may only depend on\\nthe values of the random variables $X_1,\\\\dots,X_t$ and on the distribution $F$.\\nFor quite some time the best known bound for the problem was\\n$\\\\alpha\\\\geq1-1/e\\\\approx0.632$ [Hill and Kertz, 1982]. Only recently this bound\\nwas improved by Abolhassani et al. [2017], and a tight bound of\\n$\\\\alpha\\\\approx0.745$ was obtained by Correa et al. [2017]. The case where $F$\\nis unknown, such that the decision whether $\\\\tau=t$ may depend only on the\\nvalues of the first $t$ random variables but not on $F$, is equally well\\nmotivated (e.g., [Azar et al., 2014]) but has received much less attention. A\\nstraightforward guarantee for this case of $\\\\alpha\\\\geq1/e\\\\approx0.368$ can be\\nderived from the solution to the secretary problem. Our main result is that\\nthis bound is tight. Motivated by this impossibility result we investigate the\\ncase where the stopping time may additionally depend on a limited number of\\nsamples from~$F$. An extension of our main result shows that even with $o(n)$\\nsamples $\\\\alpha\\\\leq 1/e$, so that the interesting case is the one with\\n$\\\\Omega(n)$ samples. Here we show that $n$ samples allow for a significant\\nimprovement over the secretary problem, while $O(n^2)$ samples are equivalent\\nto knowledge of the distribution: specifically, with $n$ samples\\n$\\\\alpha\\\\geq1-1/e\\\\approx0.632$ and $\\\\alpha\\\\leq\\\\ln(2)\\\\approx0.693$, and with\\n$O(n^2)$ samples $\\\\alpha\\\\geq0.745-\\\\epsilon$ for any $\\\\epsilon>0$.',\n",
       "   'cat': 'cs.DS',\n",
       "   'authors': ['José R. Correa',\n",
       "    'Paul Dütting',\n",
       "    'Felix Fischer',\n",
       "    'Kevin Schewior']}),\n",
       " ('4',\n",
       "  {'id': '1811.06110v1',\n",
       "   'title': 'Layered Belief Propagation for Low-complexity Large MIMO Detection Based\\n  on Statistical Beams',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06110v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'This paper proposes a novel layered belief propagation (BP) detector with a\\nconcatenated structure of two different BP layers for low-complexity large\\nmulti-user multi-input multi-output (MU-MIMO) detection based on statistical\\nbeams. To reduce the computational burden and the circuit scale on the base\\nstation (BS) side, the two-stage signal processing consisting of slow varying\\nouter beamformer (OBF) and group-specific MU detection (MUD) for fast channel\\nvariations is effective. However, the dimensionality reduction of the\\nequivalent channel based on the OBF results in significant performance\\ndegradation in subsequent spatial filtering detection. To compensate for the\\ndrawback, the proposed layered BP detector, which is designed for improving the\\ndetection capability by suppressing the intra- and inter-group interference in\\nstages, is introduced as the post-stage processing of the OBF. Finally, we\\ndemonstrate the validity of our proposed method in terms of the bit error rate\\n(BER) performance and the computational complexity.',\n",
       "   'cat': 'cs.IT',\n",
       "   'authors': ['Takumi Takahashi',\n",
       "    'Antti Tölli',\n",
       "    'Shinsuke Ibi',\n",
       "    'Seiichi Sampei']}),\n",
       " ('5',\n",
       "  {'id': '1811.06109v1',\n",
       "   'title': 'Predictive Modeling with Delayed Information  a Case Study in E-commerce\\n  Transaction Fraud Control',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06109v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'In Business Intelligence, accurate predictive modeling is the key for\\nproviding adaptive decisions. We studied predictive modeling problems in this\\nresearch which was motivated by real-world cases that Microsoft data scientists\\nencountered while dealing with e-commerce transaction fraud control decisions\\nusing transaction streaming data in an uncertain probabilistic decision\\nenvironment. The values of most online transactions related features can return\\ninstantly, while the true fraud labels only return after a stochastic delay.\\nUsing partially mature data directly for predictive modeling in an uncertain\\nprobabilistic decision environment would lead to significant inaccuracy on risk\\ndecision-making. To improve accurate estimation of the probabilistic prediction\\nenvironment, which leads to more accurate predictive modeling, two frameworks,\\nCurrent Environment Inference (CEI) and Future Environment Inference (FEI), are\\nproposed. These frameworks generated decision environment related features\\nusing long-term fully mature and short-term partially mature data, and the\\nvalues of those features were estimated using varies of learning methods,\\nincluding linear regression, random forest, gradient boosted tree, artificial\\nneural network, and recurrent neural network. Performance tests were conducted\\nusing some e-commerce transaction data from Microsoft. Testing results\\nsuggested that proposed frameworks significantly improved the accuracy of\\ndecision environment estimation.',\n",
       "   'cat': 'cs.LG',\n",
       "   'authors': ['Junxuan Li',\n",
       "    'Yung-wen Liu',\n",
       "    'Yuting Jia',\n",
       "    'Yifei Ren',\n",
       "    'Jay Nanduri']}),\n",
       " ('6',\n",
       "  {'id': '1811.06106v3',\n",
       "   'title': 'Multivariate Time-series Similarity Assessment via Unsupervised\\n  Representation Learning and Stratified Locality Sensitive Hashing \\n  Application to Early Acute Hypotensive Episode Detection',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06106v3',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'Timely prediction of clinically critical events in Intensive Care Unit (ICU)\\nis important for improving care and survival rate. Most of the existing\\napproaches are based on the application of various classification methods on\\nexplicitly extracted statistical features from vital signals. In this work, we\\npropose to eliminate the high cost of engineering hand-crafted features from\\nmultivariate time-series of physiologic signals by learning their\\nrepresentation with a sequence-to-sequence auto-encoder. We then propose to\\nhash the learned representations to enable signal similarity assessment for the\\nprediction of critical events. We apply this methodological framework to\\npredict Acute Hypotensive Episodes (AHE) on a large and diverse dataset of\\nvital signal recordings. Experiments demonstrate the ability of the presented\\nframework in accurately predicting an upcoming AHE.',\n",
       "   'cat': 'cs.CV',\n",
       "   'authors': ['Jwala Dhamala',\n",
       "    'Emmanuel Azuh',\n",
       "    'Abdullah Al-Dujaili',\n",
       "    'Jonathan Rubin',\n",
       "    \"Una-May O'Reilly\"]}),\n",
       " ('7',\n",
       "  {'id': '1811.06541v1',\n",
       "   'title': 'Mathematical Modeling of Arterial Blood Pressure Using\\n  Photo-Plethysmography Signal in Breath-hold Maneuver',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06541v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'Recent research has shown that each apnea episode results in a significant\\nrise in the beat-to-beat blood pressure and by a drop to the pre-episode levels\\nwhen patient resumes normal breathing. While the physiological implications of\\nthese repetitive and significant oscillations are still unknown, it is of\\ninterest to quantify them. Since current array of instruments deployed for\\npolysomnography studies does not include beat-to-beat measurement of blood\\npressure, but includes oximetry, it is both of clinical interest to estimate\\nthe magnitude of BP oscillations from the photoplethysmography (PPG) signal\\nthat is readily available from sleep lab oximeters. We have investigated a new\\nmethod for continuous estimation of systolic (SBP), diastolic (DBP), and mean\\n(MBP) blood pressure waveforms from PPG. Peaks and troughs of PPG waveform are\\nused as input to a 5th order autoregressive moving average model to construct\\nestimates of SBP, DBP, and MBP waveforms. Since breath hold maneuvers are shown\\nto simulate apnea episodes faithfully, we evaluated the performance of the\\nproposed method in 7 subjects (4 F; 32+-4 yrs., BMI 24.57+-3.87 kg/m2) in\\nsupine position doing 5 breath maneuvers with 90s of normal breathing between\\nthem. The modeling error ranges were (all units are in mmHg) -0.88+-4.87 to\\n-2.19+-5.73 (SBP); 0.29+-2.39 to -0.97+-3.83 (DBP); and -0.42+-2.64 to\\n-1.17+-3.82 (MBP). The cross validation error ranges were 0.28+-6.45 to\\n-1.74+-6.55 (SBP); 0.09+-3.37 to -0.97+-3.67 (DBP); and 0.33+-4.34 to\\n-0.87+-4.42 (MBP). The level of estimation error in, as measured by the root\\nmean squared of the model residuals, was less than 7 mmHg',\n",
       "   'cat': 'eess.SP',\n",
       "   'authors': ['Armin Soltan Zadi',\n",
       "    'Raichel M. Alex',\n",
       "    'Rong Zhang',\n",
       "    'Donald E. Watenpaugh',\n",
       "    'Khosrow Behbehani']}),\n",
       " ('8',\n",
       "  {'id': '1811.06103v1',\n",
       "   'title': 'Deep Neural Networks based Modrec  Some Results with Inter-Symbol\\n  Interference and Adversarial Examples',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06103v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'Recent successes and advances in Deep Neural Networks (DNN) in machine vision\\nand Natural Language Processing (NLP) have motivated their use in traditional\\nsignal processing and communications systems. In this paper, we present results\\nof such applications to the problem of automatic modulation recognition.\\nVariations in wireless communication channels are represented by statistical\\nchannel models and their parameterization will increase with the advent of 5G.\\nIn this paper, we report effect of simple two path channel model on our naive\\ndeep neural network based implementation. We also report impact of adversarial\\nperturbation to the input signal.',\n",
       "   'cat': 'cs.LG',\n",
       "   'authors': ['S. Asim Ahmed',\n",
       "    'Subhashish Chakravarty',\n",
       "    'Michael Newhouse']}),\n",
       " ('9',\n",
       "  {'id': '1811.06100v1',\n",
       "   'title': 'Newton Methods for Convolutional Neural Networks',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06100v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'Deep learning involves a difficult non-convex optimization problem, which is\\noften solved by stochastic gradient (SG) methods. While SG is usually\\neffective, it may not be robust in some situations. Recently, Newton methods\\nhave been investigated as an alternative optimization technique, but nearly all\\nexisting studies consider only fully-connected feedforward neural networks.\\nThey do not investigate other types of networks such as Convolutional Neural\\nNetworks (CNN), which are more commonly used in deep-learning applications. One\\nreason is that Newton methods for CNN involve complicated operations, and so\\nfar no works have conducted a thorough investigation. In this work, we give\\ndetails of all building blocks including function, gradient, and Jacobian\\nevaluation, and Gauss-Newton matrix-vector products. These basic components are\\nvery important because with them further developments of Newton methods for CNN\\nbecome possible. We show that an efficient MATLAB implementation can be done in\\njust several hundred lines of code and demonstrate that the Newton method gives\\ncompetitive test accuracy.',\n",
       "   'cat': 'stat.ML',\n",
       "   'authors': ['Chien-Chih Wang', 'Kent Loong Tan', 'Chih-Jen Lin']}),\n",
       " ('10',\n",
       "  {'id': '1811.06846v1',\n",
       "   'title': 'Improving Fingerprint Pore Detection with a Small FCN',\n",
       "   'url': 'http://arxiv.org/pdf/1811.06846v1',\n",
       "   'date': '2018-11-14',\n",
       "   'abs': 'In this work, we investigate if previously proposed CNNs for fingerprint pore\\ndetection overestimate the number of required model parameters for this task.\\nWe show that this is indeed the case by proposing a fully convolutional neural\\nnetwork that has significantly fewer parameters. We evaluate this model using a\\nrigorous and reproducible protocol, which was, prior to our work, not available\\nto the community. Using our protocol, we show that the proposed model, when\\ncombined with post-processing, performs better than previous methods, albeit\\nbeing much more efficient. All our code is available at\\nhttps://github.com/gdahia/fingerprint-pore-detection',\n",
       "   'cat': 'cs.CV',\n",
       "   'authors': ['Gabriel Dahia', 'Maurício Pamplona Segundo']})]"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "papers_l[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get embeddings of abs and title\n",
    "## 对title和abs做embedding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "def get_embeddings(text_l, batch_size=32):\n",
    "    res = []\n",
    "    for i in trange(0, len(text_l), batch_size):\n",
    "        batch_text = ['search_document: ' + _ for _ in text_l[i:i+batch_size]]\n",
    "        res.append(embedding_model.encode(batch_text))\n",
    "    return np.concatenate(res,axis=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "title_l = [paper[1]['title'] for paper in papers_l]\n",
    "abs_l = [paper[1]['abs'] for paper in papers_l]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "100%|██████████| 16803/16803 [09:33<00:00, 29.31it/s]\n"
     ]
    }
   ],
   "source": [
    "title_embeddings = get_embeddings(title_l)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "abs_embeddings = get_embeddings(abs_l)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Convert embeddings into faiss-index\n",
    "### 将向量储存为faiss index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "title_index = faiss.IndexFlatL2(title_embeddings.shape[1])\n",
    "title_index.add(title_embeddings)\n",
    "\n",
    "abs_index = faiss.IndexFlatL2(abs_embeddings.shape[1])\n",
    "abs_index.add(abs_embeddings)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save faiss-index, replacing the .bin file in the database folder.\n",
    "### 向量保存到本地，替换掉database文件夹中的.bin文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "faiss.write_index(faiss.index_gpu_to_cpu(title_index), 'titles.index')\n",
    "\n",
    "faiss.write_index(faiss.index_gpu_to_cpu(abs_index), 'abstracts.index')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Save the mapping from paper ID to index locally, replacing the arxivid_to_index_abs.json file.\n",
    "### 将paper id到索引的映射保存到本地，替换掉 arxivid_to_index_abs.json文件"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "paperid_2_index = {}\n",
    "for paper in papers_l:\n",
    "    paper_id = paper[1]['id']\n",
    "    index = paper[0]\n",
    "    paperid_2_index[paper_id] = int(index)\n",
    "with open('./paperid_to_index.json', 'w') as f:\n",
    "    json.dump(paperid_2_index, f, indent=4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Modify the file paths in the __init__ fuction within src/database.py.\n",
    "### 对src/database.py中的__init__部分的初始化文件路径做相应的修改"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "base",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
